_________________________________________________

PPOL 670 | Final Project

##Load Libraries

The question that lies at the heart of our analysis is: Using a small number of continuous variables, how accurately can our specified model predict median household income in United States Census Tracts? We believe this question to be critical in helping define salient predictors of the amount of wealth households in America are capable of accumulating. As of February 2022 per the Center of Poverty and Social Policy in Columbia University, poverty rates in the US were around 14.4%, with the phase out of Child Tax Credit (CTC) payments hastening the increase in childhood poverty rates. Thus, we believe that our question poses important implications to both short-term concerns related to the quality of life of American families as well as long-term indicators of economic growth and progress in the country at large. And, perhaps if certain unapparent variables demonstrate to have some level of predictive power over the outcome of median household income, digestable changes can be made in the lives of everyday Americans with serve to improve their lives.

To put this question to test, we adopt a combination of supervised and unsupervised machine learning methods, as well as various Exploratory Data Analysis, API Queries, and Geospatial tools to guide our data collection and decisionmaking processes. Git and its version control capabilities were also crucial to our ability to work effectively in tandem as a group.The datasets we invoke for our analysis are the American Community Survey collected from the years of 2013-2017 reflecting data on all Census Tracts in all 50 states, as well as the Childhood Opportunity Index Dataset, which gives insight on the neigborhood conditions (ie., access to recereational services andclean natural resources) of children in all census tracts.

##Census API Query Method 1 and Column Name Clean-Up

We start the exploration of our question by gathering ACS data on various variables of interest. Among these are the aggregate time spent traveling to work by individuals in a given tract, the total count of females, the number of computers, the number of individuals who have completed high school, the number of people with a GED, the number of people who have completed an associate’s degree, and the number of people who have completed a bachelor’s degree.

#Loading in API, gathering data --------------------
#Census api key
#Sys.getenv("CENSUS_API_KEY")

#Loading in state variable list
states = c("CA", "OR", "WA", "NV", "AZ", "CO", "ID", "MT", "WY", "NM", "AK", "HI","UT",
           "ND", "SD", "NE", "KS", "MN", "IA", "MO", "WI", "IL", "IN", "MI", "OH",
           "OK", "TX", "LA", "AR", "MS", "AL", "GA", "FL", "TN", "KY", "WV", "VA", "NC", "SC", "DE","MD",
           "ME", "NH", "VT", "MA", "CT", "RI", "NJ", "PA", "NY")

#Browse variable names 
variables_2017 <- load_variables(2017, "acs5", cache = TRUE)

#Use get_acs method to retrieve desired data on all 50 states 
states_2017 <- get_acs(
  geography = "tract",
  variables = c(total_pop = "B01003_001", #total population
                travel_min_work_agg = "B08133_001", #aggregate travel time to work in minutes
                female_count = "B01001_026", #total female
                computer_count ="B28010_001", #computers in household
                education_alloc = "B99151_001", #allocation of educational attainment (number of people who have completed HS)
                med_income = "B19013_001", #median income level in last 12 months (this is our outcome variable)
                hs_count = "B15003_017", #number of people with a HS diploma
                ged_count = "B15003_018", #number of people with a ged
                assoc_count = "B15003_021", #number of people with an associates degree
                bach_count= "B15003_022" #number of people with bachelor's degree
                ), 
  output = "wide",
  state = states,
  year = 2017) # data from years 2013 - 2017

##Read in Childhood Opportunity Index Dataset and Join Census Data with New Dataset Now, we read in the Childhood Opportunity Index Dataset. The variables of interest here are Z-scores of overall Childhood Opportunity Indices nationally-normed, Health and Environmental Indices nationally-normed, and education domain nationally normed. We want to join this with our previous datasets using GEOID as the common link in order to conduct a more robust analysis.

#reading in childhood opportunity data 
child_opp_data1 <- read_csv("data/index.csv")

#clean childhood opportunity data 
child_opp_data2 <- child_opp_data1 %>%
  na.omit(child_opp_data1) %>%
  filter(year == 2015) %>%
  select(geoid, pop, msaname15, z_ED_nat, z_HE_nat, z_COI_nat) %>%
  clean_names() %>%
  rename(GEOID = geoid)

#joining the census and COI data
states_2017 <- left_join(states_2017, child_opp_data2, by= "GEOID")

The COI aggregates multiple factors related to strong child development, such as child education (e.g., distance from early child education centers) and health (e.g., air quality and access to greenspace).

##Cleaning the states_2017 joined dataset

#this function will help us get the state from each census tract 
stateName  <- function(input) {
  tractCountyState <- unlist(strsplit(input, ", "))
  stateName <- tractCountyState[3]
  
  return(stateName)
}

#we're iterating through the states_2017 "names" column and using the function
#above to pull the state names
stateList <- map(.x = states_2017$NAME, .f = ~stateName(.x))
#we're converting the format to a tibble so we can add it to states_2017
stateTibble <- as.tibble(unlist(stateList))

#adding a state column to our census states_2017 datababy
states_2017 <- bind_cols(
  states_2017,
  stateTibble
)

#Renaming geoid and name columns 
states_2017 <- states_2017 %>%
  rename(geoid = GEOID,
         name = NAME,
         state_name = value)

#drop the E in the estimate columns
for ( col in 1:ncol(states_2017)){
    colnames(states_2017)[col] <-  sub("E.*", "", colnames(states_2017)[col])
}

#dropping error estimate columns
states_2017 <-states_2017 %>%
  select(-c("total_popM", "travel_min_work_aggM", "female_countM", "computer_countM", "computer_countM", "med_incomeM", "hs_countM",
            "ged_countM", "assoc_countM", "bach_countM"))

#removing all NAs
states_2017 <- na.omit(states_2017)

#turning all 0's to 1's for the log
states_2017[states_2017 == 0] <- 1

#creating the relative frequencies for all variables 
states_2017 <- states_2017 %>%
    mutate(
      travel_min_work_agg_prop = travel_min_work_agg/total_pop,
      female_count_prop = female_count/total_pop,
      computer_count_prop = computer_count/total_pop,
      hs_count_prop = hs_count/total_pop,
      ged_count_prop = ged_count/total_pop,
      assoc_count_prop = assoc_count/total_pop,
      bach_count_prop = bach_count/total_pop,
      education_alloc_prop = education_alloc/total_pop)

##Exploratory Data Analysis Phase 1: Understanding How Key Variables of Interest Interplay The goal of this initial EDA is to understand the distribution of American households across income brackets specified by the US News World Report linked below, as well as gain insight into features that characterize these income brackets.

The first vizualization titled “Number of People in the United States Living in Tracts with Specified Income Brackets” shows us that the number of people living in a middle class census tract is the highest, with the lower middle class bracket following behind. Bearing in mind that this does not give insight into the actual number of people in each of these income brackets, it tells us that the majority of Americans are residing in what can be known as “middle-class America.” In terms of a specific number of census tracts in each bracket, we observe that 384 fall in Poor, 2494 fall in Lower-Middle Class, 6787 fall in Middle Class, and 114 fall in upper class. Finding ways to grow the middle-class, given that nearly 3,000 tracts fall below it, is crucial.

The second vizualization (a boxplot) titled “Proportional of Females in Census Tracts by Income Bracket” shows us that the average proportion of females is roughly the same acorss each income bracket, which is telling that it will likely not be a useful predictive variable in predicting median household income levels in a given census tract. On the other hand, the average proportion of bachelor degrees appears to noticeably vary across income brackets across income brackets, with those in the upper income bracket having a significantly higher amount than those in the lower.

The final vizualization (a boxplot) titled “Proportional Agg. Time Traveled To Work in Census Tracts by Income Bracket” demonstrates that average aggregated travel time for members in tracts, taken as a proportion of the total individuals in the given tract, is highest for those in the upper income level and lowest for those in the poorest income level. This tell us that those who are poor likely lack the facilities to travel for work, and therefore take jobs closer to their homes that would necessitate a lesser commute.

Broadly speaking, this initial phase of analysis tell us that transporation and education related variables are more important than gender related variables in determining income outcomes.

##EDA Part 1

#finding number of people in different census tract income brackets
states_2017_EDA <- states_2017 %>% 
  mutate(
    income_bracket = case_when(
    med_income <= 32058 ~ "Poor-Near Poor",
    med_income <= 53413 ~ "Lower-Middle Class", 
    med_income <= 156000 ~ "Middle Class", 
    TRUE ~ "Upper") 
  )

#creating a barplot depicting distribution of people across income brackets
income_bracket_viz <- states_2017_EDA %>%
  ggplot(mapping = aes(x = income_bracket, y = total_pop, color = income_bracket)) + 
  geom_col() +
  labs(
    title = "Number of People in the United States in Tracts by Specified Income Brackets", 
    subtitle = "Information on the Relative Sizes of Income Brackets", 
    x = "Income Bracket", 
    y = "Number of People", 
    caption = "ACS 5 Year Survey"
  ) + 
   scale_x_discrete(limits = c("Poor-Near Poor", "Lower-Middle Class", "Middle Class", "Upper"))

income_bracket_viz 

#finding number of census tracts in specified income bracket
income_bracket_table <- states_2017_EDA %>%
  mutate(poor = if_else(income_bracket == "Poor-Near Poor", 1, 0)) %>% #gen dummy variable
  mutate(lower_middle =if_else(income_bracket == "Lower-Middle Class", 1, 0)) %>%
  mutate(middle = if_else(income_bracket == "Middle Class",1, 0)) %>%
  mutate(upper = if_else(income_bracket == "Upper", 1, 0)) %>%
  summarize(
  poor = sum(poor == 1),
  lower_middle = sum(lower_middle == 1),
  middle = sum(middle == 1),
  upper = sum(upper == 1)
 )

income_bracket_table
## # A tibble: 1 × 4
##    poor lower_middle middle upper
##   <int>        <int>  <int> <int>
## 1   384         2494   6787   114
#creating a boxplot showing proportion of females in specified income brackets
female_count_viz <- ggplot(states_2017_EDA, aes (x = female_count_prop, y = income_bracket, color = income_bracket)) + 
  geom_boxplot() + 
   labs(
    title = "Proportional of Females in Census Tracts by Income Bracket", 
    subtitle = "Gender Relevant Information on Income Brackets", 
    x = "Proportion of Females", 
    y = "Income Bracket", 
    caption = "ACS 5 Year Survey"
  ) + 
  scale_y_discrete(limits = c("Poor-Near Poor", "Lower-Middle Class", "Middle Class", "Upper")) 
  
female_count_viz

#creating a boxplot showing proportion of bachelor degree attainment in specified income brackets
bach_count_viz <- ggplot(states_2017_EDA, aes (x = bach_count_prop, y = income_bracket, color = income_bracket)) + 
  geom_boxplot() + 
   labs(
    title = "Proportional of Bachelor Degrees in Census Tracts by Income Bracket", 
    subtitle = "Education Relevant Information on Income Brackets", 
    x = "Proportion of Bachelor Degrees", 
    y = "Income Bracket", 
    caption = "ACS 5 Year Survey"
  ) + 
  scale_y_discrete(limits = c("Poor-Near Poor", "Lower-Middle Class", "Middle Class", "Upper")) 
  
bach_count_viz

#creating a boxplot depicting travel times to work by income bracket
travel_time_viz <- ggplot(states_2017_EDA, aes (x = travel_min_work_agg_prop, y = income_bracket, color = income_bracket)) + 
  geom_boxplot() + 
   labs(
    title = "Proportional Agg. Time Traveled To Work in Census Tracts by Income Bracket", 
    subtitle = "Transportation Relevant Information on Income Brackets", 
    x = "Proportional Aggregate Time Traveled", 
    y = "Income Bracket", 
    caption = "ACS 5 Year Survey"
  ) + 
  scale_y_discrete(limits = c("Poor-Near Poor", "Lower-Middle Class", "Middle Class", "Upper")) 
  
travel_time_viz

https://money.usnews.com/money/personal-finance/family-finance/articles/where-do-i-fall-in-the-american-economic-class-system

##Exploratory Data Analysis Phase 2: Understanding Variable Importance The grid of each potential variable displays possible values of the variable and the number of census tracts at each value. The goal is to determine the potential distribution of each variable. As the grid indicates, each variable is right-skewed - median is less the mode and there is a long tail of outliers.

The second visualization examines the correlation of the variables with one another. The only variable with a concerningly high correlation is total population. It is extremely correlated (r>.90) with total females and total computers. Total population and total female population are directly connected as for every increase in total female population there will be an increase in total population (the reverse is not necessarily true). Due to technological advancements and decreases in price, computers are widely available. Many jobs require a computer as well as school assignments, so the correlation with population and total computers is consistent. Otherwise, no correlation amongst the other variables is concerning.

Due to the skew of the variables and the correlation amongst total population, manipulating this variable to yield relative frequencies would elminate these issues.

##EDA Part 2

#Distribution of Variables ------------
summary(states_2017$med_income)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7232   50773   66351   71425   87361  250001
p1 <- states_2017 %>%
  ggplot(aes(med_income)) +
  geom_freqpoly(binwidth = 1000) + 
  labs(x = "Median Income",
       y = "Count",
  )

summary(states_2017$travel_min_work_agg)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6730   46820   64615   72051   87520  973160
p3 <- states_2017 %>%
  ggplot(aes(travel_min_work_agg)) +
  geom_freqpoly(binwidth = 1000) +
  labs(x = "Total Seconds Traveled to Work",
       y = "Count",
  )

summary(states_2017$computer_count)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      71    1358    1733    1866    2194   18670
p4 <- states_2017 %>%
  ggplot(aes(computer_count)) +
  geom_freqpoly(binwidth = 1000) +
  labs(x = "Computer Count")

summary(states_2017$female_count)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     481    1950    2500    2734    3229   33262
p5 <- states_2017 %>%
  ggplot(aes(female_count)) +
  geom_freqpoly(binwidth = 1000) +
  labs(x = "Total Females")

summary(states_2017$total_pop)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     911    3849    4944    5384    6360   65528
p6 <- states_2017 %>%
  ggplot(aes(total_pop)) +
  geom_freqpoly(binwidth = 1000) +
  labs(x = "Total Population")

summary(states_2017$hs_count)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     4.0   498.5   729.0   795.3  1012.5  6816.0
p7 <- states_2017 %>%
  ggplot(aes(hs_count)) +
  geom_freqpoly(binwidth = 1000) +
  labs(x = "Total High School Degrees")

summary(states_2017$ged_count)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1      53      95     118     155    2037
p8 <- states_2017 %>%
  ggplot(aes(ged_count)) +
  geom_freqpoly(binwidth = 1000) +
  labs(x = "Total GEDs")

summary(states_2017$assoc_count)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1     180     272     308     387    2795
p9 <- states_2017 %>%
  ggplot(aes(assoc_count)) +
  geom_freqpoly(binwidth = 1000) +
  labs(x = "Total Associate's Degree")


summary(states_2017$bach_count)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0   372.0   635.0   743.3   974.0 13767.0
p10 <- states_2017 %>%
  ggplot(aes(bach_count)) +
  geom_freqpoly(binwidth = 1000) +
  labs(x = "Total Bacehlor's Degrees")


summary(states_2017$education_alloc)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      96    2598    3311    3572    4224   37832
p11 <- states_2017 %>%
  ggplot(aes(education_alloc)) +
  geom_freqpoly(binwidth = 1000) +
  labs(x = "Total Education Allotment")

grid.arrange(p1, p3, p4, p5, p6, p7, p8, p9, p10, p11, nrow = 3, top = "Distribution of Variables") 

#Correlation of Variables--------------

grid_corr <- function(x, y){
    corr <- par("corr"); on.exit(par(corr))
    par(corr = c(0, 1, 0, 1))
    r <- round(cor(x, y), digits=2)
    txt <- paste0("R = ", r)
    text(1000, 1000,  txt)
}
upper_grid<-function(x, y){
  points(x,y, pch = 19)
}
pairs(data = states_2017,
      ~med_income + total_pop + travel_min_work_agg + female_count + computer_count + hs_count + ged_count + assoc_count + bach_count + education_alloc,
      lower.panel = grid_corr,
      upper.panel = upper_grid)

#here's the dataframe we'll use for the model
only_numb <- states_2017 %>%
   select(med_income, total_pop, ends_with("prop"), starts_with("z"))
# library(Hmisc)
# tiff("test.tiff", units="in", width=5, height=5, res=300)
#it looks like we should center some variables, but not log everything
#we need to log GED
# hist.data.frame(only_numb)
# dev.off()

eda_variables <- knitr::include_graphics("images/greatestFig")
eda_variables

##Preparing for Supervised Machine Learning Models Now, our aim is to run various supervised machine learning models in order to gauge the efficacy of our specified predictors in determining median household income levels. In order to do so, we rely upon v-fold cross validation techniques and a recipe that is carefully crafted to match the needs and limitations of our data.

#Set up a testing environment
#set the seed and split the data
set.seed(20211101)
states_2017_split <- initial_split(states_2017, prop = .8)
states_2017_train1 <- training(states_2017_split)
states_2017_test <- testing(states_2017_split)

#here's the dataframe we'll use for the model
states_2017_train <- states_2017_train1 %>%
  select(med_income, total_pop, ends_with("prop"), starts_with("z"))

only_numb <- states_2017 %>%
   select(med_income, total_pop, ends_with("prop"), starts_with("z"))

#resampling with 10 folds
folds <- vfold_cv(data = states_2017_train, v = 10)

#crafting recipe
states_2017_rec <- recipe(med_income ~ ., data = states_2017_train) %>%
  #removes rows with nans
  step_naomit(all_predictors()) %>%
  #removes variables that are sparce and unbalanced
  step_nzv(all_predictors()) %>%
  step_filter_missing(all_predictors()) %>%
  #logs ged since it's skewed
  step_log(ged_count_prop) %>%
  #centers our variables but not the z-scored ones
  step_center(all_predictors(), -starts_with("z_")) 

##Linear Regression Model We begin with a linear regression model. Our RMSE from this model is reasonably high at around 13,370, which given the range of our median household income level data from 7232 to 250001, is not extremely high but certainly higher than we would like to observe. This means that the error in the model in predicting our data is by 13,370 dollars, which is not necessarily indicative of the type of predictive robustness we would like to observe.In this model, the variable z_coi_nat appears to have the highest level of importance, with computer_count_prop falling closely behind.

#setting the linear reg model
lm_mod <- linear_reg() %>%
  set_engine(engine = "lm") %>%
  set_mode(mode = "regression")

#workflow
lm_wf <- workflow() %>%
  add_recipe(recipe = states_2017_rec) %>%
  add_model(spec = lm_mod)

#resampling using the 10-v-fold process specified above
lm_cv <- lm_wf %>%
  fit_resamples(resamples = folds)

#using RMSE to choose the best model
lm_best <- lm_cv %>%
  select_best("rmse")

#finalize the workflow using the established workflow and the best model (according to RMSE)
lm_final <- lm_wf %>%
  finalize_workflow(parameters = lm_best)

#fitting the workflow with the original data
lm_fit <- lm_final %>%
  fit(states_2017_train)

#RMSE looks consistent across folds
collect_metrics(lm_cv, summarize = FALSE) %>%
  filter(.metric == "rmse") %>%
  ggplot(mapping = aes(x = id, y = .estimate, group = .estimator, label = round(.estimate, digits = 2))) +
  geom_line() +
  geom_point() +
  geom_label(nudge_y = -500) +
  scale_y_continuous(limits = c(10000, 17500)) +
  labs(title = "Median Income: Linear Regression Model (RMSE Across 10 Folds)", 
       y = "Predicted RMSE",
       x = "Fold Number") +
  theme_minimal()

#mean RMSE across the 10 samples
collect_metrics(lm_cv) %>%
  filter(.metric == "rmse") %>%
  select(mean) %>%
  pull(mean)
## [1] 13730.82
lm_fit %>%
  extract_fit_parsnip() %>%
  vip(num_features = 10)

summary(states_2017$med_income)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7232   50773   66351   71425   87361  250001

##K nearest neighbors model We follow with a K nearest neighbors model to see if a non-parametric approach will yield a stronger RMSE value. However, our RMSE from this model is higher than the the RMSE yielded from the linear model - at around 13,425. This value is also not ideal.

knn_mod <- nearest_neighbor(neighbors = 5) %>%
  set_engine(engine = "kknn") %>%
  set_mode(mode = "regression")

#workflow situation
knn_workflow <- workflow() %>%
  add_model(spec = knn_mod) %>%
  add_recipe(recipe = states_2017_rec)

#resampling using the 10-v-fold process specified above
knn_res <- knn_workflow %>%
  fit_resamples(resamples = folds)

#using RMSE to choose the best model
knn_best <- knn_res %>%
  select_best("rmse")

#finalize the workflow using the established workflow and the best model (according to RMSE)
knn_final <- knn_workflow %>%
  finalize_workflow(parameters = knn_best)

#fitting the workflow with the original data
knn_fit <- knn_final %>%
  fit(states_2017_train)

#applying the updated workflow to the fit across folds
knn_fit_rs <- knn_final %>%
  fit_resamples(resamples = folds)

#RMSE looks consistent across folds
collect_metrics(knn_res, summarize = FALSE) %>%
  filter(.metric == "rmse") %>%
  ggplot(mapping = aes(x = id, y = .estimate, group = .estimator, label = round(.estimate, digits = 2))) +
  geom_line() +
  geom_point() +
  geom_label(nudge_y = -500) +
  scale_y_continuous(limits = c(10000, 17500)) +
  labs(title = "Median Income: KNN Model (RMSE Across 10 Folds)", 
       y = "Predicted RMSE",
       x = "Fold Number") +
  theme_minimal()

#mean RMSE across the 10 samples
collect_metrics(knn_res) %>%
  filter(.metric == "rmse") %>%
  select(mean) %>%
  pull(mean)
## [1] 13425.06

##Random forest model We conclude with a Random forest model. This aproach averages the estimates from many independent trees. Specifically, this model use 1000 trees, applies a tuning grid to use stronger mtry and min_n values, and uses a regression framework. The RMSE yielded this model was lowest, at 11,193 (and 5582.86 on the testing data). Although this model is our strongest, we discuss methods of improvement in the limitations section.

treeNumb <- 10
gridVal <- 10
#since random forests can be computationally heavier, we use parallel processing
cores <- parallel::detectCores()

#random forest with 1000 iterations while utilizing the parallel processing (num.threads = cores)
rf_mod1 <- rand_forest(mtry = tune(), min_n = tune(), trees = treeNumb) %>% 
  set_engine(engine = "ranger", num.threads = cores, importance = "impurity") %>%
  set_mode("regression")

#initiating the workflow for random forests
rf_workflow1 <- workflow() %>% 
  add_recipe(recipe = states_2017_rec) %>%
  add_model(spec = rf_mod1)

#tune with the given number of folds and grid-value
rf_res <- rf_workflow1 %>% 
  tune_grid(resamples = folds, grid = gridVal,
            control = control_grid(save_pred = TRUE))

#these are the best hyperparameters (mtry = , min_n = )
rf_best <- rf_res %>% 
  select_best(metric = "rmse")

#applying the best hyperparameters to our original workflow (since we used tune() as "place holders" for mtry and min_n)
rf_final <- rf_workflow1 %>%
  finalize_workflow(parameters = rf_best)

#apply the worlflow when fitting the data
rf_last_fit <- rf_final %>%
   last_fit(states_2017_split)

#fitting the workflow with the original data
rf_final_fit <- fit(rf_final, only_numb)

#applying the updated workflow to the fit
rf_fit_rs <- rf_final %>%
  fit_resamples(resamples = folds)

#mean RMSE across the 10 samples
collect_metrics(rf_fit_rs) %>%
  filter(.metric == "rmse") %>%
  select(mean) %>%
  pull(mean)
## [1] 11747.91
rf_last_fit %>%
  extract_fit_parsnip() %>%
  vip(num_features = 10)

##Prediction

#making a prediction
predictions_testing <- bind_cols(states_2017_test,
    predict(object = rf_final_fit, new_data = states_2017_test)
  )
predictions_testing
## # A tibble: 1,956 × 27
##    geoid name  total_pop travel_min_work… female_count computer_count med_income
##    <chr> <chr>     <dbl>            <dbl>        <dbl>          <dbl>      <dbl>
##  1 0607… Cens…      4592            86585         2456           1570      68375
##  2 0603… Cens…      5671            64220         3010           1799      69811
##  3 0603… Cens…      5582            77735         3117           1716      83966
##  4 0603… Cens…      5951            83790         3191           1925      67844
##  5 0603… Cens…      4623            73175         2378           1462      78900
##  6 0611… Cens…      7142            60910         3536           2504      38904
##  7 0603… Cens…      5432            95620         2488           2800      33300
##  8 0600… Cens…      4124            76545         2153           1580      56005
##  9 0600… Cens…      6231            78715         3063           1941      40989
## 10 0607… Cens…      6422            87795         3145           1873      44347
## # … with 1,946 more rows, and 20 more variables: hs_count <dbl>,
## #   ged_count <dbl>, assoc_count <dbl>, bach_count <dbl>,
## #   education_alloc <dbl>, pop <dbl>, msaname15 <chr>, z_ed_nat <dbl>,
## #   z_he_nat <dbl>, z_coi_nat <dbl>, state_name <chr>,
## #   travel_min_work_agg_prop <dbl>, female_count_prop <dbl>,
## #   computer_count_prop <dbl>, hs_count_prop <dbl>, ged_count_prop <dbl>,
## #   assoc_count_prop <dbl>, bach_count_prop <dbl>, …
#rmse for test data
rmse(data = predictions_testing, truth = states_2017_test$med_income, estimate = .pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       8332.

##Map figure 1

#recruiting the testing data for the map figure
predictedData <- predictions_testing %>%
  select(-name, -pop, -msaname15) %>%
  group_by(state_name) %>%
  summarise(
    mean_coi = round(mean(z_coi_nat), 3),
    mean_he = round(mean(z_he_nat), 3),
    mean_ed = round(mean(z_ed_nat), 3),
    median_inc = median(med_income),
    predicted_inc = round(median(.pred), 0)) %>%
  ungroup()

#loading in the mad that looks best and removing all population-related columns
map_US <- usa_sf("lcc") %>%
  select(-starts_with("pop")) %>%
  mutate(state_name = name)

#here we're fusung the income and state data - this won't be needed later on
map_US_inc <- left_join(map_US, predictedData, by = "state_name")

#this is where we'll put all relevant data that will appear when we hover over a state
map_US_inc <- map_US_inc %>% 
  mutate(median_income = median_inc,
         hoverText = paste0("State: ", as.character(state_name), "\n", 
                            "Median Income: $", as.character(median_inc), "\n", 
                            "Predicted Median Income: $", as.character(predicted_inc), "\n",
                            "Mean COI z-score: ", as.character(mean_coi), "\n",
                            "Mean Health z-score: ", as.character(mean_he)))

#here's the figure - we can come back and make it look prettier if needed
map1 <- map_US_inc %>%
  ggplot() +
  geom_sf_interactive(size = 0.2, color = "white", aes(fill = median_income,
    data_id = hoverText, tooltip = hoverText)) +
  theme_void() +
  theme(legend.position='none')

#this initializes the interactive bit
girafe(ggobj = map1) %>%
  girafe_options(opts_hover(css = "fill:blue;"))

##Map figure 2

map_US_Interactive2 <- map_US_inc %>%
  select(name, median_inc, predicted_inc, mean_coi, mean_he, mean_ed)

library(mapview)
map_US_Interactive2 %>%
  mapview(zcol = "median_inc", 
          map.types = c("CartoDB.Voyager","Esri.NatGeoWorldMap","Esri.WorldPhysical"))

##Unsupervised Machine Learning Models We conclude our analysis by running K-means of variables selected from our previous models that best predicted median household income. The variables we selected are computer_count, bach_count, travel_min_work_agg, and hs_count. For this method, we are simply looking to conduct K-mean clustering of all 50 states, Puerto Rico, and Washington, DC, by their principle componenents, as referenced by their principle components. We found the ideal number of clusters to be 2 from our sum of squares, silhouette, and gap-stat visualizations. Our first plot shows each of the states alongside a color matrix of their corresponding states. This is a helpful representation of the general displacement of each state. In our second plot, we see two distinct clusters, coded as red and blue. The majority of our states fall within the first cluster (red), while California, New York, Texas, and Florida fall within our second cluster (blue). As it pertains to median household income, we did not find our clustering properly reflects closesness in household median income.

url <- str_glue("https://api.census.gov/data/2017/acs/acs5?get=NAME,B28010_001E,B15003_022E,B08133_001E,B15003_017E,B19013_001E&for=state:*")
popurl <- GET(url = url)
http_status(popurl)
## $category
## [1] "Success"
## 
## $reason
## [1] "OK"
## 
## $message
## [1] "Success: (200) OK"
popurl <- content(popurl, as = "text")
pop_matrix <- fromJSON(popurl)
#turn body of character matrix into a tibble
states <- as_tibble(pop_matrix[1:nrow(pop_matrix),],.name_repair = "minimal")
names(states) <- states[1,]
states = states[-1,]

#take out na values
states[is.na(states)] <- 0

#only contains values 
drop <- c("NAME", "state", "B19013_001E")
states_numeric <- states[,!(names(states) %in% drop)]
states_numeric[1:4] <- sapply(states_numeric[1:4], as.numeric)
head(states_numeric)
## # A tibble: 6 × 4
##   B28010_001E B15003_022E B08133_001E B15003_017E
##         <dbl>       <dbl>       <dbl>       <dbl>
## 1     1103514      259993    28842875      473078
## 2     2386203      718198    63483095     1079154
## 3      419975      145089     8189310      174647
## 4      748405      251182    17166770      286241
## 5     1052249      306611    30521260      476520
## 6      526710      209224    17760430      223845
#run PCA on states_numeric
states_pca <- prcomp(states_numeric, center = TRUE, scale. = TRUE)

#how much variance explained? 
summary(states_pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4
## Standard deviation     1.9840 0.2323 0.07929 0.05804
## Proportion of Variance 0.9841 0.0135 0.00157 0.00084
## Cumulative Proportion  0.9841 0.9976 0.99916 1.00000
#extract principle components
states_pcs <- states_pca %>% 
  .$x %>% 
  as_tibble()

#combine pc's to names
states_pcs <- bind_cols(
  select(states, -state), 
  select(states_pcs, PC1, PC2)
)

#total within sum of squares to find ideal number of clusters
fviz_nbclust(states_numeric, FUN = kmeans, method = "wss")

#total silhouette width
fviz_nbclust(states_numeric, FUN = kmeans, method = "silhouette")

#gap statistic
fviz_nbclust(states_numeric, FUN = kmeans, method = "gap_stat")

#creating a function to kmeans
num_kmeans <- function(k, data){
  #generate k-means clusters
  kmeans <- kmeans(states_numeric, 
  centers = k, 
  nstart = 100
  );
#create dataframe that binds together clusters and PC's
  states_clusters <- bind_cols(
  select(states, NAME), 
  select(data, PC1, PC2), 
  cluster = kmeans$cluster
  )
}

kmeans_2 <- num_kmeans(k = 2, data = states_pcs) 
kmeans_2
## # A tibble: 52 × 4
##    NAME              PC1      PC2 cluster
##    <chr>           <dbl>    <dbl>   <int>
##  1 Mississippi   -1.03   -0.0237        1
##  2 Missouri      -0.0320 -0.160         1
##  3 Montana       -1.49    0.0813        1
##  4 Nebraska      -1.26    0.0754        1
##  5 Nevada        -1.00    0.00564       1
##  6 New Hampshire -1.36    0.115         1
##  7 New Jersey     1.13    0.0881        1
##  8 New Mexico    -1.27    0.0477        1
##  9 Alabama       -0.460  -0.122         1
## 10 Alaska        -1.60    0.106         1
## # … with 42 more rows
#graphing Kmeans with color as name
ggplot() +
  geom_point(
  kmeans_2,
  mapping = aes(PC1, PC2, color = factor(NAME)),
    alpha = 0.5
    ) +
  labs(
    title = "K-Means with K= 2 and PCA",
    x = "PC1",
    y = "PC2"
    ) +
  theme_minimal() +
  guides(text = NULL)

#graphing kmeans with label as name
ggplot(
  kmeans_2, aes(x = PC1, y = PC2, color = factor(cluster), label = NAME)) +
  geom_point() +
  geom_text(hjust = 0, vjust = 0) + 
  labs(
    title = "K-Means with K = 2 and PCA",
    x = "PC1",
    y = "PC2"
    ) +
  theme_minimal() +
  guides(text = NULL)

#summarizing by median household income
drop1 <- c("B28010_001E","B15003_022E","B08133_001E","B15003_017E")
states_med <- states %>%
  select(-c(drop1))
states_med <- states_med %>%
  rename(median_income = "B19013_001E")

states_med$median_income <- as.numeric(states_med$median_income)

states_med <- states_med[order(states_med$median_income),]

states_med
## # A tibble: 52 × 3
##    NAME           median_income state
##    <chr>                  <dbl> <chr>
##  1 Puerto Rico            19775 72   
##  2 Mississippi            42009 28   
##  3 Arkansas               43813 05   
##  4 West Virginia          44061 54   
##  5 Alabama                46472 01   
##  6 Kentucky               46535 21   
##  7 Louisiana              46710 22   
##  8 New Mexico             46718 35   
##  9 Tennessee              48708 47   
## 10 South Carolina         48781 45   
## # … with 42 more rows

The broad takeaway of our analysis is that Childhood Opportunity Indices and educational attainment indicators in adolescence and adulthood play an almost certain role in determining the median household income level of households. In a similar vein, technological indicators, like access to computers or means of transportation, also have some explanatory power over income. When applying our most robust indicators from our supervised machine learning models to an unsupervised model relating to states within the US, the extent to which the categorization of states extends to an explanation of median household income level is questionable. Thus, perhaps there are more robust indicators that would be better suited for predicting median household income, and the question of what these are would be the focus of further exploration. For the sake of this project, it is evident that a small number of continuous variables can indeed be used to predict median household income to an extent, but defining exactly what these are will require extensive research.

Through the process of data analysis, we encountered many challenges. First and foremost, parsing through the ACS dataset and obtaining variables was somewhat difficult due to the structure of the data as well as the sheer size of it. Though we had categories of variables in mind that we wanted to include in our models, it is possible that we neglected to include certain variables that would have had a more meaningful impact on our analysis. Thankfully, we partially mitigated this limitation in part by adding a second dataset to our analysis, which turned out to contain the most important predictor variable across all three supervised machine learning models. We also had to occasionally make judgement calls within our recipe on variables that would be better off logged

Furthermore, the RMSE values in our models were not as robust as we would have liked to see. Simply put, our error rates do not appear low enough to justify real-world implementation of our models. With that said, we have nonetheless gained insight into the types of variables that appear to be important in any analysis related to income.